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Abstract 



We consider the task of learning a classifier from the feature space X to the set of 
classes y = {0, 1}, when the features can be partitioned into class-conditionally 
independent feature sets Xi and A2. We show the surprising fact that the class- 
conditional independence can be used to represent the original learning task in 
terms of 1) learning a classifier from X2 to Xi and 2) learning the class-conditional 
distribution of the feature set Xi . This fact can be exploited for semi-supervised 
learning because the former task can be accomplished purely from unlabeled sam- 
ples. We present experimental evaluation of the idea in two real world applica- 
tions. 



1 Introduction 

Semi-supervised learning is said to occur when the learner exploits (a presumably large quantity of) 
unlabeled data to supplement a relatively small labeled sample, for accurate induction. The high 
cost of labeled data and the simultaneous plenitude of unlabeled data in many application domains, 
has led to considerable interest in semi-supervised learning in recent years. 

We show a somewhat surprising consequence of class-conditional feature independence that leads to 
a simple semi-supervised learning algorithm. When the feature set can be partitioned into two class- 
conditionally independent sets, we show that the original learning problem can be reformulated in 
terms of the problem of learning a predictor from one of the partitions to the other. That is, the latter 
partition acts as a surrogate for the class variable. Since such a predictor can be learned from only 
unlabeled samples, an effective semi-supervised algorithm results. 

In the next section we present the simple yet interesting result on which our semi-supervised learning 
algorithm (which we call surrogate learning) is based. We present examples to clarify the intuition 
behind the approach and present a special case of our approach that is used in the applications sec- 
tion. We then examine related ideas in previous work and situate our algorithm among previous 
approaches to semi-supervised learning. We present empirical evaluation on two real world appli- 
cations where the requked assumptions of our algorithm are satisfied. 



2 Surrogate Learning 

We consider the problem of learning a classifier from the feature space X to the set of classes 
y = {0,1}. Let the features be partitioned into A" = XixX2. The random feature vector x G A" will 
be represented correspondingly as x = (xi, X2). Since we restrict our consideration to a two-class 
problem, the construction of the classifier involves the estimation of the probability P{y — 0|xi, X2) 
at every point (xi , X2) G A". 
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We make the following assumptions on the joint probabilities of the classes and features. 

1. P(xi,X2|y) = P(xi |y)P(x2 |y) for y G {0, 1}. That is, the feature sets Xi and X2 are class- 
conditionally independent for both classes. Note that in general our assumption is less restrictive 
than the Naive Bayes assumption. 

2. P(xi|x2) 7^ 0, P(xi|y) 7^ and P(xi|y — Q) P(xi|y = 1). These assumptions are to 
avoid divide-by-zero problems in the algebra below. If xi is a discrete valued random variable and 
not irrelevant for the classification task, these conditions are often satisfied. 

Under these assumptions, surprisingly, we can establish that P(y = 0|xi,X2) can be written as 
a function of P(xi|x2) and P(xi|y). First, when we consider the quantity P(y,xi|x2), we may 
derive the following. 



P(y,xi|x2) = P(xi|y,X2)P(y|x2) 
==> P(y, xi |x2) = P(xi |y)P(y|x2) (from the independence assumption) 
^ P(y|xi,X2)P(xilx2) = P(xi|y)P(y|x2) 
^ P(y|x,^X2)P(x,|x2) ^^^ 

Since P(y = 0|x2) + P(y = l|x2) = 1, Equation[T]implies 

P(y = 0|xi,X2)P(xi|x2) , P(y = l|Xi,X2)P(xi|x2) ^ 



P(xi|y = 0) P(xi|y = l) 

P(y-0|xi,X2)P(xi|x2) , (1-P(y = 0|xi,X2))P(xi|x2) 



(2) 



P(xi|y = 0) P(xi|y = l) 

Solving Equation|2]for P(y = 0|xi, X2), we obtain 

p, nl ^ ^(xi|y = 0) P(xi|y = l)-P(xi|x2) 

P(y = 0xi,X2) = —- — ■ — ^-^7 — \ r\ 57 — \ 7^ 

P(xi|x2) P(xi|y = 1) - P(xi|y = 0) 

We have succeeded in writing P(y — 0|xi, X2) as a function of P(xi |x2) and P(xi |y). This leads 
to a significant simplification of the learning task when a large amount of unlabeled data is available, 
especially if xi is finite valued. The learning algorithm involves the following two steps. 

• Estimate the quantity P(xi |x2) from only the unlabeled data, by building a predictor from 
the feature space X2 to the space Xi . There is no restriction on the learning algorithm for 
this prediction task. 

• Estimate the quantity P(xi |y) from a smaller labeled sample by counting. 

Thus, we can decouple the prediction problem into two separate tasks, one of which involves pre- 
dicting xi from the remaining features. In other words, xi serves as a surrogate for the class label. 
Furthermore, for the two steps above there is no necessity for complete samples. All the labeled 
examples can have the feature X2 missing. 

The following example illustrates the intuition behind surrogate learning. 



Example 1 

Consider a two-class problem, where xi is a binary feature and X2 is a one dimensional real-valued 
feature. The class-conditional distribution of X2 for the class y = is Gaussian, and for the class 
y = 1 is Laplacian as shown in Figure[T]A. 

Because of the class-conditional feature independence assumption, the joint distribution 
P(xi,X2,y) can now be completely specified by fixing the joint probability P(xi,y). Let 
P(xi = 0,y = 0) = 0.3, P(xi 0,y = 1) = 0.1, P(xi = l,y = 0) = 0.2, and 
P(xi = l,y = 1) = 0.4. The full joint distribution is depicted in Figure [T]B. Also shown in 
Figure[T]B are the conditional distributions P(xi = 0|x2) and P(y = 0|xi,X2). 

Assume that we have a classifier to decide between xi = and xi = 1 from the feature X2. If 
this classifier is used to classify a sample that is from class y = 0, it will most likely be assigned 
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Figure 1: A) Class-conditional probability distributions of the feature X2, B) the joint distributions 
and the posterior distributions of the class y and the surrogate class xi. 



the 'label' Xi = (because, for class y = 0, xi = is more likely than xi 1), and a sample 
that is from class y = 1 is often assigned the 'label' xi = 1. Consequently the classifier between 
xi = and xi = 1 provides information about the true class label y. This can also be seen in the 
similarities between the curves P{y = 0|xi, X2) to the curve P(xi |x2). 

2.1 A Special Case 

We now specialize the above setting of the classification problem to the one realized in the appli- 
cations we present later We still wish to learn a classifier from = ^"1 x to the set of classes 
y = {0, 1}. We make the following assumptions. 

1. xi is a binary random variable. That is, Xi — {0, 1}. 

2. P(xi,X2|y = 0) = P(xi|y = 0)P(x2|y = 0). We require that the feature Xi be class- 
conditionally independent of the remaining features only for the class y = 0. 

3. P(xi = 0, y = 1) = 0. This assumption says that xi is a '100% recall' feature for y = fl 

Assumption 3 simplifies the learning task to the estimation of the probability P{y = 0|xi — 1, X2) 
for every point X2 £ A'2. We can proceed as before to obtain the expression in Equation[3] 



P(y = 0|xi = 1,X2) 



P(xi = l|y = 0) P(xi = l|y = 1) - P(xi = l|x2) 
P(xi = 1|X2) P(xi = l|y = 1) - P(xi - l|y - 0) 

P(xi ^ l|y ^0) 1-P(xi = 1|X2) 
P(xi = l|x2) ■ 1-P(xi = l|y = 0) 

P(xi ^ l|y ^0) P(xi ^ 0|X2) 

P(xi = 0|y = 0) ■ (1-P(xi =0|x2)) 



(4) 



Equation m shows that P(y = 0|xi = 1,X2) is a monotonically increasing function of P(xi = 
0|x2). This means that after we build a predictor from X2 to Xi, we only need to establish the 
threshold on P(xi = 0|x2) to yield the optimum classification between y = and y = 1. Therefore 
the learning proceeds as follows. 

• Estimate the quantity P(xi |x2) from only the unlabeled data, by building a predictor from 
the feature space X2 to the binary space Xi. Again, there is no restriction on this prediction 
algorithm. 

• Use a small labeled sample to establish the threshold on P(xi — 0|x2). 

In the unlabeled data, we call the samples that have xi = 1 as the target samples and those that have 
Xi = as the background samples. The reason for this terminology is clarified in Example 2. 



This assumption can be seen to trivially enforce the independence of the features for class y = 1. 
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Example 2 



We consider a problem with distributions P(x2 |y) identical to Example 1 (Figure[T] A), except with 
the joint probability P(xi,y) given by P(xi = 0,y = 0) = 0.3, P(xi = 0,y = 1) = 0.0, 
P(xi = l,y = 0) = 0.2, and P(xi = l,y = 1) = 0.5. The class-and-feature joint distribution is 
depicted in Figure |2] Clearly, xi is a 100% recall feature for y = 1. 



Note that on the samples from the class y ~ it is impossible to determine whether it is a sample 
from the target or background better than random by looking at the X2 feature, whereas a sample 
from the positive class is always a target. Therefore the background samples serve to delineate the 
positive examples among the targets. 



Although the idea of using unlabeled data to improve classifier accuracy has been around for several 
decades |8 1, semi-supervised learning has received much attention recently due to impressive results 
in some domains. The compilation of chapters edited by Chappelle et al. is an excellent introduc- 
tion to the various approaches to semi-supervised learning, and the related practical and theoretical 
issues i6j. 

Identical to our setup, co-training assumes that the features can be split into two class-conditionally 
independent sets or 'views' [3 1. Also assumed is the sufficiency of either view for accurate classifi- 
cation. The co-training algorithm iteratively uses the unlabeled data classified with high confidence 
by the classifier on one view, to generate labeled data for learning the classifier on the other 

The intuition underlying co-training is that the errors caused by the classifier on one view are inde- 
pendent of the other view, hence can be conceived as uniforrr0 noise added to the training examples 
for the other view. Consequently, the number of label errors in a region in the feature space is pro- 
portional to the number of samples in the region. If the former classifier is reasonably accurate, the 
proportionally distributed errors are 'washed out' by the correctly labeled examples for the latter 
classifier. 

The main distinction of surrogate learning from co-training is the learning of a predictor from one 
view to the other, as opposed to learning predictors from both views to the class label. We can there- 
fore eliminate the requirement that both views be sufficiently informative for reasonably accurate 
prediction. Furthermore, unlike co-training, surrogate learning has no iterative component. 

Ando and Zhang propose an algorithm to regularize the hypothesis space by simultaneously con- 
sidering multiple classification tasks on the same feature space {T\. They then use their so-called 
structural learning algorithm for semi-supervised learning of one classification task, by the artificial 
construction of 'related' problems on unlabeled data. This is done by creating problems of predict- 
ing observable features of the data and learning the structural regularization parameters from these 
'auxiliary' problems and unlabeled data. More recently in [2] they showed that, with conditionally 
independent feature sets predicting from one set to the other allows the construction of a feature 
representation that leads to an effective semi-supervised learning algorithm. Our approach directly 
operates on the original feature space and can be viewed another justification for the algorithm in |[T|. 

■^Whether or not a label is erroneous is independent of the feature values of the latter view. 




Figure 2: The joint distributions and 
the posterior distributions for the Ex- 
ample 2. 



3 Related Work 
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Castelli and Cover have studied the relative value of labeled and unlabeled samples for learning in 
a specialized setting where the class-conditional feature distributions are identifiable, and can be 
estimated from an unlabeled dataset |4 5|. After the mixture is identified from a large number of 
unlabeled samples and the classification boundary is defined, labeled examples are necessary only 
to specify the 'orientation' of the boundary, i.e., to assign class labels to the regions in the feature 
space. We note the parallel in surrogate learning (cf. Equation|4|i, where a large amount of unlabeled 
data can be used to estimate the 'terrain' (x ^p^xT^olxi)) labeled data is necessary to choose the 
contour that defines the classification boundary. 

Multiple Instance Learning (MIL) is a learning setting where training data is provided as positive and 
negative bags of samples 1 7 1 . A negative bag contains only negative examples whereas a positive bag 
contains at least one positive example. Surrogate learning can be viewed as artificially constructing a 
MIL problem, with the targets acting as one positive bag and the backgrounds acting as one negative 
bag (Section |2Tt . The class-conditional feature independence assumption for class y = translates 
to the identical and independent distribution of the negative samples in both bags. 



4 Two Applications 

We applied surrogate learning to problems in record linkage and natural language processing. We 
will explain below how the learning problems in both the applications can be made to satisfy the 
assumptions in our second (100% recall) setting. 

4.1 Record Linkage 

Record linkage is the process of identification and merging of records of the same entity in different 
databases or the unification of records in a single database, and constitutes an important component 
of data management. The reader is referred to [9J for an overview of the record linkage problem, 
strategies and systems. 

Our problem consisted of merging each of « 20000 physician records, which we call the update 
database, to the record of the same physician in a master database of « 10^ records. The update 
database has fields that are absent in the master database and vice versa. The fields in common 
include the name (first, last and middle initial), several address fields, phone, specialty, and the 
year-of-graduation. Although the last name and year-of-graduation are consistent when present, the 
address, specialty and phone fields have several inconsistencies owing to different ways of writing 
the address, new addresses, different terms for the same specialty, missing fields, etc. However, 
the name and year alone are insufficient for disambiguation. We had access to « 500 manually 
matched update records for training and evaluation (about 40 of these update records were labeled 
as unmatchable due to insufficient information). 

The general approach to record linkage involves two steps: 1) blocking, where a small set of can- 
didate records is retrieved from the master record database, which contains the correct match with 
high probability, and 2) matching, where the fields of the update records are compared to those of 
the candidates for scoring and selecting the match. We performed blocking by querying the master 
record database with the last name from the update record. Matching was done by scoring a fea- 
ture vector of similarities over the various fields. The feature values were either binary (verifying 
the equality of a particular field in the update and a master record) or continuous (some kind of 
normalized string edit distance between fields like street address, first name etc.). 

The surrogate learning solution to our matching problem was set up as follows. We designated 
the binary feature of equality of year of graduatioi^ as the surrogate label Xi, and the remaining 
features are relegated to X2. The required conditions for surrogate learning are satified because 
1) in our data it is highly unlikely for two records with different year- of- graduation to belong 
to the same physician and 2) if it is known that the update record and a master record belong to 
two different physicians, then knowing that they have the same (or different) year-of-graduation 
provides no information about the other features. Therefore all the feature vectors with the binary 
feature indicating equality of year-of-graduation are targets and the remaining are backgrounds. 



'We believe that the equahty of the middle intial would have worked just as well for xi. 
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Table 1: Precision and Recall for record linkage. The surrogate learning algorithm had access to 
none of the manually matched records. 





Training 


Precision 


Recall 




proportion 






Surrogate 




0.96 


0.95 


Supervised 


0.5 


0.96 


0.94 


Supervised 


0.2 


0.96 


0.91 



First, we used feature vectors obtained from the records in all blocks from all 20000 update records to 
estimate the probability P(xi |x2). We used logistic regression for this prediction task. For learning 
the logistic regression parameters, we discarded the feature vectors for which Xi was missing and 
performed mean imputation for the missing values of other features. Second, the probability P(xi = 
l|y = 0) (the probability that two different randomly chosen physicians have the same year of 
graduation) was estimated straightforwardly from the counts of the different years-of-graduation in 
the master record database. 

These estimates were used to assign the score P{y — l|xi — 1, X2) to the records in a block (cf. 
Equation nil. The score of is assigned to feature vectors which have xi ~ 0. The only caveat is 
calculating the score for feature vectors that had missing xi . For such records we assign the score 
P(y = l|x2) = P(y = l|xi = l,X2)i-'(xi — l|x2). We have estimates for both quantities on the 
right hand side. The highest scoring record in each block was flagged as a match if it exceeded some 
appropriate threshold. 

We compared the results of the surrogate learning approach to a supervised logistic regression based 
matcher which used a portion of the manual matches for training and the remaining for testing. 
Table 1 shows the match precision and recall for both the surrogate learning and the supervised 
approaches. For the supervised algorithm, we show the results for the case where half the manually 
matched records were used for training and half for testing, as well as for the case where a fifth 
of the records of training and the remaining four-fifths for testing. In the latter case, every record 
participated in exactly one training fold but in four test folds. 

The results indicate that the surrogate learner performs better matching by exploiting the unlabeled 
data than the supervised learner with insufficient training data. The results although not dramatic are 
still promising, considering that the surrogate learning approach used none of the training records. 

4.2 Merger- Acquisition Sentence Classification 

Sentence classification is often a preprocessing step for event or relation extraction from text. One 
of the challenges posed by sentence classification is the diversity in the language for expressing the 
same event or relationship. We present a surrogate learning approach for constructing a sentence 
classifier that detects a merger-acquisition (MA) event between two organizations in financial news 
(in other words, we find paraphrases for the MA event). 

We assume that the unlabeled sentence corpus is time-stamped and named entity tagged with or- 
ganizations. We further assume that a MA sentence must mention at least two organizations. Our 
approach to build the sentence classifier is the following. We first extract all the so-called source 
sentences from the corpus that match a few high-precision seed patterns. An example of a seed 
pattern used for the MA event is '<ORGl> acquired <ORG2>' (see Example 3 below). 

We then extract every sentence in the corpus that contains at least two organizations, such that at 
least one of them matches an organization in the source sentences, and has a time-stamp within a 
two month time window of the matching source sentence. Of this set of sentences, all that contain 
two or more organizations from the same source sentence are designated as target sentences, and the 
rest are designated as background sentences. 

We speculate that since an organization is unUkely to have a MA relationship with two different orga- 
nizations in the same time period the backgrounds are unlikely to contain MA sentences, and more- 
over the language of the non-MA target sentences is indistinguishable from that of the background 
sentences. To relate the approach to surrogate learning, we note that the binary "organization-pair 
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equality" feature (both organizations in the current sentence being the same as those in a source 
sentence) serves as the '100% recall' feature xi. The language in the sentence is the feature set X2. 
This setup satisfies the required conditions for surrogate learning because 1) if a sentence is about 
MA, the organization pair mentioned in it must be the same as that in a source sentence, (i.e., if only 
one of the organizations match those in a source sentence, the sentence is unlikely to be about MA) 
and 2) if an unlabeled sentence is non-MA, then knowing whether or not it shares an organization 
with a source does not provide any information about the language in the sentence. 

We then trained a support vector machine (SVM) classifier to discriminate between the targets and 
backgrounds. The feature set (X2) used for this task was a bag of word unigrams, bigrams and 
trigrams, generated from the sentences and selected by ranking the n-grams by the divergence of 
their distributions in the targets and backgrounds. The sentences were ranked according to the score 
assigned by the SVM (which is a proxy for P(xi = l|x2)). This score was then thresholded to 
obtain a classification between MA and non-MA sentences. 

Example 3 below lists some sentences to illustrate the surrogate learning approach. Note that the 
targets may contain both MA and non-MA sentences but the backgrounds are unUkely to be MA. 



Example 3 
Seed Pattern 

"offer for <ORG>" 
Source Sentences 

1. <ORG>US Airways<ORG> said Wednesday it will increase its offer for <ORG>Delta<ORG>. 
Target Sentences (SVM score) 

1. <ORG>US Airways<ORG> were to combine with a standalone <ORG>Delta<ORG>. (1.0008563) 

2. <ORG>US Airways<ORG> argued that the nearly $10 billion acquisition of <ORG>Delta<ORG> 
would result in an efficiently run carrier that could offer low fares to fliers. (0.99958149) 

3. <ORG>US Airways<ORG> is asking <ORG>Delta<ORG>'s official creditors conmiittee to support 
postponing that hearing. (-0.99914371) 

Background Sentences (SVM score) 

1. The cities have made various overtures to <ORG>US Airways<ORG>, including a promise from 
<ORG> America West Airlines<ORG> and the former <ORG>US Airways<ORG>. (0.99957752) 

2. <ORG>US Airways<ORG> shares rose 8 cents to close at $53.35 on the <ORG>New York Stock 
Exchange<ORG>. (-0.99906444) 



We tested our algorithm on an unlabeled corpus of approximately 700000 financial news articles. We 
experimented with five seed patterns (<ORG> acquired <ORG>, <ORG> bought <ORG>, 
offer for <ORG>, to buy <ORG>, merger with <ORG>) which resulted in 870 source sen- 
tences. The participants that were extracted from sources resulted in approximately 12000 target 
sentences and approximately 120000 background sentences. For the purpose of evaluation, 500 ran- 
domly selected sentences from the targets were manually checked leading to 330 being tagged as 
MA and the remaining 170 as non-MA. This corresponds to a 66% precision of the targets. 

We then ranked the targets according to the score assigned by the SVM trained to classify between 
the targets and backgrounds, and selected all the targets above a threshold as paraphrases for MA. 
Table 3 presents the precision and recall on the 500 manually tagged sentences as the threshold 
varies. The results indicate that our approach provides an effective way to rank the target sentences 
according to their likelihood of being about MA. 

We also evaluated the capability of the method to find paraphrases by conducting five separate 
experiments using each pattern in Table 2 individually as the only seed and counting the number of 
obtained sentences containing each of the other patterns (using a threshold of 0.0). We found that 
the method was effective in finding paraphrases that have very different language than the sources. 
We do not provide the numbers due to space considerations. 

Finally we used the paraphrase sentences, which were found by surrogate learning, to augment the 

training data for a MA sentence classifier and evaluated its accuracy. We first built a SVM classifier 
only on a portion of the labeled targets and used the remaining as the test set. This approach yielded 
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Table 2: Precision/Recall of surrogate learning on the MA sentence problem for various thresholds. 
The baseline of using all the targets as paraphrases for MA has a precision of 66% and a recall of 
100%. 



Threshold 


Precision 


Recall 


0.0 


0.83 


0.94 


-0.2 


0.82 


0.95 


-0.8 


0.79 


0.99 



an accuracy of 76% on the test set (with two-fold cross vaUdation). We then added all the targets 
scored above a threshold by surrogate learning as positive examples (4000 positive sentences in all 
were added), and all the backgrounds that scored below a low threshold as negative examples (27000 
sentences), to the training data and repeated the two-fold cross vahdation. The classifier learned on 
the augmented training data improved the accuracy on the test data to 86% . 

We believe that better designed features (than word n-grams) will provide paraphrases with higher 
precision and recall of the MA sentences found by surrogate learning. To apply our approach to a 
new event extraction problem, the design step also involves the selection of the xi feature such that 
the targets and backgrounds satisfy our assumptions. 

5 Conclusions 

We presented surrogate learning - a simple semi-supervised learning algorithm that can be applied 
when the features satisfy the required independence assumptions. We presented two applications, 
showed how the assumptions are satisfied, and presented empirical evidence for the efficacy of our 
algorithm. We expect that surrogate learning is sufficiently general to be applied in diverse domains, 
if the features are carefully designed. We are developing a version of the algorithm that allows the 
statistical independence assumption to be relaxed to mean independence. 
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